{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3DNHhOsQX4Rg"
      },
      "source": [
        "##### Copyright 2024 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "cellView": "form",
        "id": "_bO8_SJzX4t6"
      },
      "outputs": [],
      "source": [
        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ABzryOPIYE2O"
      },
      "source": [
        "# Getting Started with Constrained generation with Gemma 2 using Llamacpp and Guidance\n",
        "\n",
        "[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n",
        "Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n",
        "\n",
        "Constrained generation is a method that modifies the token generation process of a generative model to limit its predictions for subsequent tokens to only those that adhere to the necessary output structure.\n",
        "\n",
        "[llama.cpp](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources. In llama.cpp, formal grammars are defined using the GBNF (GGML BNF) format to constrain model outputs. It can be used, for instance, to make the model produce legitimate JSON or to communicate exclusively in emojis.\n",
        "\n",
        "[Guidance](https://github.com/guidance-ai/guidance/tree/main?tab=readme-ov-file#constrained-generation) is an effective programming paradigm for steering language models. Guidance reduces latency and costs compared to traditional prompting or fine-tuning while allowing you to control the output's structure and provide high-quality output for your use case.\n",
        "\n",
        "In this notebook, you will learn how to perform constrained generation in Gemma 2 models using `llama.cpp` and `guidance` in a Google Colab environment. You'll install the necessary packages, set up the model, and run a sample prompt.\n",
        "\n",
        "<table align=\"left\">\n",
        "<td>\n",
        " <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Constrained_generation.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "</td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6TDJ3j0rh-Uw"
      },
      "source": [
        "## Setup\n",
        "\n",
        "### Select the Colab runtime\n",
        "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
        "\n",
        "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
        "2. Select **Change runtime type**.\n",
        "3. Under **Hardware accelerator**, select **T4 GPU**.\n",
        "\n",
        "### Gemma setup\n",
        "\n",
        "**Before you dive into the tutorial, let's get you set up with Gemma:**\n",
        "\n",
        "1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n",
        "2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n",
        "3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.\n",
        "4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n",
        "\n",
        "**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7tvumI5CiEeG"
      },
      "source": [
        "### Configure your HF token\n",
        "\n",
        "Add your Hugging Face token to the Colab Secrets manager to securely store it.\n",
        "\n",
        "1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src=\"https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg\" alt=\"The Secrets tab is found on the left panel.\" width=50%>\n",
        "2. Create a new secret with the name `HF_TOKEN`.\n",
        "3. Copy/paste your token key into the Value input box of `HF_TOKEN`.\n",
        "4. Toggle the button on the left to allow notebook access to the secret."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "lpXvNz0HeKc1"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "\n",
        "# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n",
        "# vars as appropriate for your system.\n",
        "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fxfWSVsaiFQK"
      },
      "source": [
        "### Install dependencies\n",
        "\n",
        "You'll need to install a few Python packages and dependencies to interact with HuggingFace along with `llama-cpp-python` and `guidance`. Find some of the releases of `llama-cpp-python` supporting CUDA 12.2 [here](https://abetlen.github.io/llama-cpp-python/whl/cu122/llama-cpp-python/).\n",
        "\n",
        "Run the following cell to install or upgrade it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "Xm9wxRQ_eUkC"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[?25l   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/447.5 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K   \u001b[91m━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m143.4/447.5 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K   \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m440.3/447.5 kB\u001b[0m \u001b[31m11.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m447.5/447.5 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting guidance\n",
            "  Downloading guidance-0.1.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)\n",
            "Collecting diskcache (from guidance)\n",
            "  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from guidance) (1.26.4)\n",
            "Collecting ordered-set (from guidance)\n",
            "  Downloading ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB)\n",
            "Requirement already satisfied: platformdirs in /usr/local/lib/python3.10/dist-packages (from guidance) (4.3.6)\n",
            "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from guidance) (3.20.3)\n",
            "Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (from guidance) (2.9.2)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from guidance) (2.32.3)\n",
            "Collecting tiktoken>=0.3 (from guidance)\n",
            "  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
            "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken>=0.3->guidance) (2024.9.11)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (3.4.0)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (3.10)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (2.2.3)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (2024.8.30)\n",
            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (2.23.4)\n",
            "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (4.12.2)\n",
            "Downloading guidance-0.1.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (255 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m255.4/255.4 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m37.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading ordered_set-4.1.0-py3-none-any.whl (7.6 kB)\n",
            "Installing collected packages: ordered-set, diskcache, tiktoken, guidance\n",
            "Successfully installed diskcache-5.6.3 guidance-0.1.16 ordered-set-4.1.0 tiktoken-0.8.0\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m443.8/443.8 MB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "# The huggingface_hub library allows us to download models and other files from Hugging Face.\n",
        "!pip install --upgrade -q huggingface_hub\n",
        "\n",
        "# Install the guidance package.\n",
        "!pip install guidance\n",
        "\n",
        "# The llama-cpp-python library allows us to leverage GPUs.\n",
        "!pip install llama-cpp-python==0.2.90 \\\n",
        "  -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CtCpooOBipDy"
      },
      "source": [
        "### Logging into Hugging Face Hub\n",
        "\n",
        "Next, you’ll need to log into the Hugging Face Hub using your access token to download the Gemma model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "CzMQZ1SReagV"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n",
            "WARNING:huggingface_hub._login:Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n"
          ]
        }
      ],
      "source": [
        "from huggingface_hub import login\n",
        "\n",
        "login(os.environ[\"HF_TOKEN\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8ZjH8VRZismE"
      },
      "source": [
        "### Downloading the Gemma 2 Model\n",
        "Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "K2QM6BO3ebUY"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "15433f2b900b4034ae81f61a5780efd5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "2b_pt_v2.gguf:   0%|          | 0.00/10.5G [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'2b_pt_v2.gguf'"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from huggingface_hub import hf_hub_download\n",
        "\n",
        "# Specify the repository and filename\n",
        "repo_id = 'google/gemma-2-2b-GGUF'  # Repository containing the GGUF model\n",
        "filename = '2b_pt_v2.gguf'  # The GGUF model file\n",
        "\n",
        "# Download the model file to the current directory\n",
        "hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I28udx9hjKFO"
      },
      "source": [
        "## Constrained generation in Gemma 2 model using `llama.cpp`\n",
        "\n",
        "An advanced way to perform constrained generation is to use Context-free grammar (CFG) to direct your LLM to produce the desired structure.\n",
        "\n",
        "Context-free grammar (CFG) can be considered a more powerful and expressive regex form. CFGs are capable of managing chores like balancing parenthesis and complex structures like nested and recursive structures.\n",
        "\n",
        "**llama.cpp** supports CFG through a format called GBNF (GGML Backus-Naur Form). In short, GBNF defines formal grammars that limit the outputs of models in `llama.cpp`. You can read more about GBNF and its syntax from [llama.cpp's README page](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).\n",
        "\n",
        "You can use the `LlamaGrammer` function from llama_cpp to read the grammar as a string or read the grammar from a GBNF File.\n",
        "\n",
        "For this example, you will create a GBNF grammar to show football (soccer) player statistics as JSON. Here, you are defining the GBNF grammar directly as a string within your code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "3EKu0HmQd6pn"
      },
      "outputs": [],
      "source": [
        "from llama_cpp.llama import LlamaGrammar\n",
        "\n",
        "# Define the GBNF grammar as a string.\n",
        "FOOTBALL_GBNF = r\"\"\"\n",
        "ws ::= ([ \\t\\n] ws)?\n",
        "\n",
        "string ::=\n",
        "  \"\\\"\" (\n",
        "    [^\"\\\\\\x7F\\x00-\\x1F] |\n",
        "     \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])\n",
        "  )* \"\\\"\"\n",
        "\n",
        "\n",
        "digit ::= \"0\" | \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\"\n",
        "\n",
        "one-or-two-digits ::= digit | digit digit\n",
        "\n",
        "one-or-four-digits ::= digit | digit digit | digit digit digit | digit digit digit digit\n",
        "\n",
        "zero-to-two-digits ::= \"\" | digit | digit digit\n",
        "\n",
        "position ::= \"\\\"Striker\\\"\" | \"\\\"Midfielder\\\"\" | \"\\\"Defender\\\"\" | \"\\\"Goalkeeper\\\"\"\n",
        "\n",
        "world-cup ::= \"\\\"Yes\\\"\" | \"\\\"No\\\"\"\n",
        "\n",
        "stats ::= (\n",
        "  \"{\\n\" ws\n",
        "    \"\\\"goals\\\": \" one-or-four-digits \",\" ws\n",
        "    \"\\\"assists\\\": \" one-or-four-digits \",\" ws\n",
        "    \"\\\"height\\\": \" one-or-two-digits \".\" zero-to-two-digits \",\" ws\n",
        "    \"\\\"world-cup\\\": \" world-cup ws\n",
        "  \"}\"\n",
        ")\n",
        "\n",
        "player ::= (\n",
        "  \"{\\n\" ws\n",
        "    \"\\\"name\\\": \" string \",\" ws\n",
        "    \"\\\"country\\\": \" string \",\" ws\n",
        "    \"\\\"position\\\": \" position \",\" ws\n",
        "    \"\\\"stats\\\": \" stats ws\n",
        "  \"}\"\n",
        ")\n",
        "\n",
        "root ::= player\n",
        "\"\"\"\n",
        "\n",
        "# Read the GBNF grammar using `from_string` method of LlamaGrammar.\n",
        "grammar = LlamaGrammar.from_string(FOOTBALL_GBNF, verbose=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "btMUB57L7L5o"
      },
      "source": [
        "You will initialize the model using the `llama-cpp-python` library by loading a pre-trained Gemma 2 model from HuggingFace. Here's what each part of the code does:\n",
        "\n",
        "- `model_path`: Path to the model.\n",
        "- `verbose`: Disables verbose logging during model loading for cleaner output.\n",
        "- `n_gpu_layers`: Configures GPU acceleration. A value of `-1` means it will use as many GPU layers as possible.\n",
        "\n",
        "To perform constrained generation, pass the `grammar` defined above as an argument of the `create_chat_completion` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "5A3ETNJ5elMG"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{\n",
            "\"name\": \"Lionel Messi\",\n",
            "\"country\": \"Argentina\",\n",
            "\"position\": \"Midfielder\",\n",
            "\"stats\": {\n",
            "\"goals\": 60,\n",
            "\"assists\": 80,\n",
            "\"height\": 1.73,\n",
            "\"world-cup\": \"Yes\"\n",
            "}\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "model_path = \"2b_pt_v2.gguf\"\n",
        "from llama_cpp.llama import Llama\n",
        "\n",
        "llm = Llama(\n",
        "    model_path=model_path,\n",
        "    n_gpu_layers=-1,\n",
        "    verbose=False,\n",
        ")\n",
        "\n",
        "# Generate response\n",
        "output = llm.create_chat_completion(\n",
        "    messages=[\n",
        "        {\n",
        "            \"role\": \"user\",\n",
        "            \"content\": \"Using JSON, describe the following Football player: \"\n",
        "            + \"Lionel Messi\",\n",
        "        },\n",
        "    ],\n",
        "    grammar=grammar     # Pass the grammar defined earlier\n",
        ")\n",
        "\n",
        "print(output[\"choices\"][0][\"message\"][\"content\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YnLeIpe2AkRb"
      },
      "source": [
        "## Constrained generation in Gemma 2 model using `Guidance`\n",
        "\n",
        "Guidance supports context-free grammar via a purely Pythonic interface. In this example, you will use CFG along with interleaved generative constructs and regex.\n",
        "\n",
        "Interleaved generative structure specifies your structured output as generative constructs and static strings that alternate. Here the generative parts of the task can be defined individually, which will help you to maintain the output structure.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cky8DH22bZiV"
      },
      "source": [
        "In this example, you will use different operators provided by guidance to implement CFG, such as `select` and `zero_or_more`.\n",
        "\n",
        "`select`: Constrains generation to a set of options.\n",
        "\n",
        "`zero_or_more`: Content repeated zero or more times.\n",
        "\n",
        "To use regex for constrained generation, you can use the `gen` operator. Specify the regex in the `regex` argument, `gen(regex='...)`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "yaMy-QbXsWDE"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:guidance.models.llama_cpp._llama_cpp:Cannot use verbose=True in this context (probably CoLab). See https://github.com/abetlen/llama-cpp-python/issues/729\n",
            "llama_model_loader: loaded meta data with 29 key-value pairs and 288 tensors from 2b_pt_v2.gguf (version GGUF V3 (latest))\n",
            "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
            "llama_model_loader: - kv   0:                       general.architecture str              = gemma2\n",
            "llama_model_loader: - kv   1:                               general.type str              = model\n",
            "llama_model_loader: - kv   2:                               general.name str              = ff8948d2ca54b23c93d253533c6effcf2e892347\n",
            "llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192\n",
            "llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 2304\n",
            "llama_model_loader: - kv   5:                         gemma2.block_count u32              = 26\n",
            "llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 9216\n",
            "llama_model_loader: - kv   7:                gemma2.attention.head_count u32              = 8\n",
            "llama_model_loader: - kv   8:             gemma2.attention.head_count_kv u32              = 4\n",
            "llama_model_loader: - kv   9:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001\n",
            "llama_model_loader: - kv  10:                gemma2.attention.key_length u32              = 256\n",
            "llama_model_loader: - kv  11:              gemma2.attention.value_length u32              = 256\n",
            "llama_model_loader: - kv  12:                          general.file_type u32              = 0\n",
            "llama_model_loader: - kv  13:              gemma2.attn_logit_softcapping f32              = 50.000000\n",
            "llama_model_loader: - kv  14:             gemma2.final_logit_softcapping f32              = 30.000000\n",
            "llama_model_loader: - kv  15:            gemma2.attention.sliding_window u32              = 4096\n",
            "llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama\n",
            "llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default\n",
            "llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,256000]  = [\"<pad>\", \"<eos>\", \"<bos>\", \"<unk>\", ...\n",
            "llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...\n",
            "llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...\n",
            "llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 2\n",
            "llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1\n",
            "llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3\n",
            "llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0\n",
            "llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true\n",
            "llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false\n",
            "llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false\n",
            "llama_model_loader: - kv  28:               general.quantization_version u32              = 2\n",
            "llama_model_loader: - type  f32:  288 tensors\n",
            "llm_load_vocab: special tokens cache size = 249\n",
            "llm_load_vocab: token to piece cache size = 1.6014 MB\n",
            "llm_load_print_meta: format           = GGUF V3 (latest)\n",
            "llm_load_print_meta: arch             = gemma2\n",
            "llm_load_print_meta: vocab type       = SPM\n",
            "llm_load_print_meta: n_vocab          = 256000\n",
            "llm_load_print_meta: n_merges         = 0\n",
            "llm_load_print_meta: vocab_only       = 0\n",
            "llm_load_print_meta: n_ctx_train      = 8192\n",
            "llm_load_print_meta: n_embd           = 2304\n",
            "llm_load_print_meta: n_layer          = 26\n",
            "llm_load_print_meta: n_head           = 8\n",
            "llm_load_print_meta: n_head_kv        = 4\n",
            "llm_load_print_meta: n_rot            = 256\n",
            "llm_load_print_meta: n_swa            = 4096\n",
            "llm_load_print_meta: n_embd_head_k    = 256\n",
            "llm_load_print_meta: n_embd_head_v    = 256\n",
            "llm_load_print_meta: n_gqa            = 2\n",
            "llm_load_print_meta: n_embd_k_gqa     = 1024\n",
            "llm_load_print_meta: n_embd_v_gqa     = 1024\n",
            "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
            "llm_load_print_meta: f_norm_rms_eps   = 1.0e-06\n",
            "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
            "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
            "llm_load_print_meta: f_logit_scale    = 0.0e+00\n",
            "llm_load_print_meta: n_ff             = 9216\n",
            "llm_load_print_meta: n_expert         = 0\n",
            "llm_load_print_meta: n_expert_used    = 0\n",
            "llm_load_print_meta: causal attn      = 1\n",
            "llm_load_print_meta: pooling type     = 0\n",
            "llm_load_print_meta: rope type        = 2\n",
            "llm_load_print_meta: rope scaling     = linear\n",
            "llm_load_print_meta: freq_base_train  = 10000.0\n",
            "llm_load_print_meta: freq_scale_train = 1\n",
            "llm_load_print_meta: n_ctx_orig_yarn  = 8192\n",
            "llm_load_print_meta: rope_finetuned   = unknown\n",
            "llm_load_print_meta: ssm_d_conv       = 0\n",
            "llm_load_print_meta: ssm_d_inner      = 0\n",
            "llm_load_print_meta: ssm_d_state      = 0\n",
            "llm_load_print_meta: ssm_dt_rank      = 0\n",
            "llm_load_print_meta: ssm_dt_b_c_rms   = 0\n",
            "llm_load_print_meta: model type       = 2B\n",
            "llm_load_print_meta: model ftype      = all F32\n",
            "llm_load_print_meta: model params     = 2.61 B\n",
            "llm_load_print_meta: model size       = 9.74 GiB (32.00 BPW) \n",
            "llm_load_print_meta: general.name     = ff8948d2ca54b23c93d253533c6effcf2e892347\n",
            "llm_load_print_meta: BOS token        = 2 '<bos>'\n",
            "llm_load_print_meta: EOS token        = 1 '<eos>'\n",
            "llm_load_print_meta: UNK token        = 3 '<unk>'\n",
            "llm_load_print_meta: PAD token        = 0 '<pad>'\n",
            "llm_load_print_meta: LF token         = 227 '<0x0A>'\n",
            "llm_load_print_meta: EOT token        = 107 '<end_of_turn>'\n",
            "llm_load_print_meta: max token length = 48\n",
            "llm_load_tensors: ggml ctx size =    0.13 MiB\n",
            "llm_load_tensors: offloading 0 repeating layers to GPU\n",
            "llm_load_tensors: offloaded 0/27 layers to GPU\n",
            "llm_load_tensors:        CPU buffer size =  9972.92 MiB\n",
            "..................................................................\n",
            "llama_new_context_with_model: n_ctx      = 512\n",
            "llama_new_context_with_model: n_batch    = 512\n",
            "llama_new_context_with_model: n_ubatch   = 512\n",
            "llama_new_context_with_model: flash_attn = 0\n",
            "llama_new_context_with_model: freq_base  = 10000.0\n",
            "llama_new_context_with_model: freq_scale = 1\n",
            "llama_kv_cache_init:  CUDA_Host KV buffer size =    52.00 MiB\n",
            "llama_new_context_with_model: KV self size  =   52.00 MiB, K (f16):   26.00 MiB, V (f16):   26.00 MiB\n",
            "llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB\n",
            "llama_new_context_with_model:      CUDA0 compute buffer size =  2754.50 MiB\n",
            "llama_new_context_with_model:  CUDA_Host compute buffer size =     6.51 MiB\n",
            "llama_new_context_with_model: graph nodes  = 1050\n",
            "llama_new_context_with_model: graph splits = 342\n",
            "AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | \n",
            "Model metadata: {'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.bos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.add_eos_token': 'false', 'gemma2.final_logit_softcapping': '30.000000', 'gemma2.attn_logit_softcapping': '50.000000', 'general.architecture': 'gemma2', 'gemma2.context_length': '8192', 'gemma2.attention.head_count_kv': '4', 'gemma2.attention.layer_norm_rms_epsilon': '0.000001', 'general.type': 'model', 'tokenizer.ggml.eos_token_id': '1', 'gemma2.embedding_length': '2304', 'tokenizer.ggml.pre': 'default', 'general.name': 'ff8948d2ca54b23c93d253533c6effcf2e892347', 'gemma2.block_count': '26', 'gemma2.feed_forward_length': '9216', 'gemma2.attention.key_length': '256', 'gemma2.attention.head_count': '8', 'gemma2.attention.sliding_window': '4096', 'gemma2.attention.value_length': '256', 'general.file_type': '0'}\n",
            "Using fallback chat format: llama-2\n"
          ]
        }
      ],
      "source": [
        "import guidance\n",
        "import numpy as np\n",
        "from guidance import models, gen, block, optional, select, zero_or_more\n",
        "from guidance import commit_point\n",
        "\n",
        "# Load the model\n",
        "model_path = \"2b_pt_v2.gguf\"\n",
        "gemma2 = models.LlamaCpp(model_path)\n",
        "\n",
        "\n",
        "# Custom generation function to repeat the content up to two. Similar to\n",
        "# one_or_more but there is a max value here.\n",
        "@guidance(stateless=True)\n",
        "def repeat_range(lm, content, min_count=1, max_count=2):\n",
        "    for _ in range(min_count):\n",
        "        lm += content\n",
        "    if max_count == np.inf:\n",
        "        lm += zero_or_more(content)\n",
        "    else:\n",
        "        for _ in range(max_count - min_count):\n",
        "            lm += optional(content)\n",
        "    return lm\n",
        "\n",
        "# Function to generate numbers up to two digits.\n",
        "@guidance(stateless=True)\n",
        "def number(lm):\n",
        "    n = repeat_range(select(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']))\n",
        "    # Allow for negative or positive numbers\n",
        "    return lm + select(['-' + n, n])\n",
        "\n",
        "# Function to select player position.\n",
        "@guidance(stateless=True)\n",
        "def position(lm):\n",
        "    return lm + select([\"Striker\", \"Midfielder\", \"Defender\", \"Goalkeeper\"])\n",
        "\n",
        "# Function to select whether the player has won a World cup.\n",
        "@guidance(stateless=True)\n",
        "def world_cup(lm):\n",
        "    return lm + select([\"Yes\", \"No\"])\n",
        "\n",
        "# Regex function for string.\n",
        "@guidance(stateless=True)\n",
        "def string_exp(lm):\n",
        "    return lm + gen(regex='([^\\\\\\\\]*|\\\\\\\\[\\\\\\\\bfnrt\\/]|\\\\\\\\u[0-7a-z])*')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1YL3DdFi762J"
      },
      "source": [
        "For this example, you will implement a combination of interleaved generative structure and CFG to show the stats of football(soccer) players as JSON.\n",
        "\n",
        "Here, you will keep the structure and keys of the JSON static, allowing the language model to fill in the value parts. This maintains the overall structure of the output."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "f39A6-Aa73FE"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<pre style='margin: 0px; padding: 0px; vertical-align: middle; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;'>Using JSON, describe these Football players:\n",
              "Lionel Messi\n",
              "{\n",
              "&quot;name&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> &quot;</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>Lionel</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> Messi</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>&quot;,</span>\n",
              "&quot;country&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> &quot;</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>Argentina</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>&quot;,</span>\n",
              "&quot;position&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> </span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>Stri</span>ker,\n",
              "&quot;stats&quot;: {\n",
              "         &quot;goals<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>&quot;:</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>1</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>0</span>,\n",
              "    &quot;assists&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> </span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>1</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>0</span>,\n",
              "    &quot;height&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> </span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>1</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>.</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>7</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>5</span>,\n",
              "    &quot;world-cup&quot;:<span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> </span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'>Yes</span>,\n",
              "}}</pre>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# `commit_point`s are just ways of stopping functions once you hit a point.\n",
        "# For eg: commit_point(\",\") stops string_exp() once you hit `,`.\n",
        "@guidance(stateless=True)\n",
        "def simple_json(lm):\n",
        "    lm += ('{\\n' +\n",
        "     '\"name\": ' + string_exp() + commit_point(',') + '\\n'\n",
        "     '\"country\": ' + string_exp() + commit_point(',') + '\\n'\n",
        "     '\"position\": ' + position() + commit_point(',') + '\\n'\n",
        "     '\"stats\": {\\n' +\n",
        "     '         \"goals\":'+ number() + commit_point(',') + '\\n'\n",
        "     '         \"assists\": ' + number() + commit_point(',') + '\\n'\n",
        "     '         \"height\": ' + number() +'.' + number() + commit_point(',') + '\\n'\n",
        "     '         \"world-cup\": ' + world_cup() + commit_point(',') + '\\n'\n",
        "     + commit_point('}')\n",
        "     + commit_point('}'))\n",
        "    return lm\n",
        "\n",
        "# Initialize the query.\n",
        "lm = gemma2 + \"\"\"Using JSON, describe these Football players:\n",
        "Lionel Messi\n",
        "\"\"\"\n",
        "\n",
        "# Call the simple_json function and implement the JSON structure.\n",
        "lm += simple_json()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zi8O6AF4As4N"
      },
      "source": [
        "Congratulations! You've successfully implemented constrained generation with the Gemma 2 model using `llama.cpp` and `Guidance` in a Colab environment. You can now experiment with the model, update the grammar, and explore its capabilities."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "[Gemma_2]Constrained_generation.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
